Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) by dexhunter · Pull Request #1260 · openai/parameter-golf

dexhunter · 2026-04-02T17:35:11Z

Summary

val_bpb = 1.0929 (3-seed mean, std 0.0009) | 2.5145 nats | ~15.96 MB | 8xH100 SXM, 600s | No TTT
Adds MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), and mixed int5/int6 GPTQ to PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218's 4096-vocab high-WD stack
All 3 seeds under 16MB (max: 15,981,324 bytes)
No SLOT, no eval-time adaptation, fully legal

Key Innovations

MuonEq-R — Row-normalizes gradient matrices before Newton-Schulz orthogonalization. Zero-byte cost, ~0.001 BPB improvement.
Depth Recurrence — Layers 4,5 repeated with fully shared MLP weights (zero extra params). ~0.003 BPP improvement.
Mixed Int5/Int6 GPTQ — Hessian sensitivity ranking: 60 int6 + 6 int5 layers for optimal size/quality tradeoff.

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	ms/step	Sliding BPB	val_loss (nats)	Artifact
1337	5,541	106.5	1.0939	2.51667	15,933,457
42	5,530	106.7	1.0922	2.51279	15,981,324
0	5,543	106.5	1.0927	2.51394	15,960,050
Mean	5,538	106.6	1.0929	2.51447	15,958,277

Changes from PR #1218

	PR #1218	This
val_bpb	1.09785	1.09290 (-0.00495)
val_loss	~2.526 nats	2.514 nats (-0.011)
Optimizer	Muon	MuonEq-R
Depth recurrence	None	Layers 4,5
Mixed quantization	No	60 int6 + 6 int5

Credits

@clarkkev for PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 (4096-Vocab + MLP 4x + WD 0.085 — the foundation)
@abaybektursun for PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (GPTQ + XSA + BigramHash baseline)
@msisovic for PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 (depth recurrence concept)

Test plan

3-seed verification (1337, 42, 0) — all pass artifact + time + score
All seeds under 16,000,000 bytes
Train < 600s, eval < 600s
No TTT, no SLOT, no forbidden techniques
Rule checker passed (log + script)

…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.

mikeapedia · 2026-04-02T20:56:46Z

Great submission @dexhunter! Did you happen to test muon column norm or row+column norm? I found R+C worked the best with the smaller vocab and I am wondering if that holds here as well.

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Row-normalize the gradient update before Newton-Schulz orthogonalization. From PR openai#1260: ~0.001 BPB free improvement, zero extra parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

… (3-seed mean) Improves PR openai#1260 (1.0929) by using N_INT6=61 (one more int6 layer) with a smaller mini runner (21,396 bytes) that creates enough headroom. 3-seed mean: 1.0924 BPB / 2.5133 nats (seeds 42, 0, 7) All seeds under 16MB (max: 15,996,591 bytes) No TTT, no SLOT, no eval-time adaptation. Techniques: MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ, brotli-11 compression. Built on PR openai#1218 by @clarkkev.

@clarkkev

….0912 (3-seed mean) WD-quantization synergy: higher weight decay (0.090 vs 0.085) compresses 5% better, creating headroom for ALL 66 layers at int6 precision. The extra quantization quality more than recovers the WD BPP cost. 3-seed mean: 1.0912 BPB / 2.5106 nats (seeds 42, 0, 1337) All seeds under 16MB with 32K+ margins. No TTT, no SLOT, no eval-time adaptation. Built on PR openai#1218 by @clarkkev. Improves PR openai#1260 (1.0929) by 0.0017 BPP.

BiggerDABOSS mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0 #1276

Open

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean) #1279

Open

5 tasks

dexhunter mentioned this pull request Apr 3, 2026

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285

Open

4 tasks

This was referenced Apr 4, 2026

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1338

Closed

Record: SP2048 + 3-Layer Recurrence + SWA + BigramHash + Legal TTT — val_bpb 1.0955 (3-seed mean) #1339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)#1260
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:muoneqr-recurrence-mixedquant

dexhunter commented Apr 2, 2026

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 2, 2026

Summary

Key Innovations

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Changes from PR #1218

Credits

Test plan

Uh oh!

mikeapedia commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants